Obtaining access to information related to government welfare schemes is tough owing to the absence of central repositories of information, language restrictions, and personalized counselling. Existing approaches typically require complicated user interaction through a difficult-to-use interface or by posing queries through a complex process. This paper presents a multilingual RAG-assistant to enable efficient access to information about government welfare schemes. Our solution is based on a dataset consisting of more than 3,400 government schemes and enabling the system to communicate with the user in several languages through a unified interface. We employ a retrieval-augmented generation pipeline to obtain information pertaining to government schemes and give appropriate responses to the user queries. Our system has been evaluated through a series of experiments using predefined queries with relevance scores. The experiment shows that better retrieval can improve the ranking performance of our system. Additionally, the IVR-based interface is being used to make our system accessible to the rural population.
Introduction
The text describes a multilingual AI-powered government scheme assistant designed to make it easier for citizens—especially in rural and semi-urban areas—to access and understand government welfare schemes.
It highlights that information about schemes is currently scattered across multiple portals, often available only in limited languages, and difficult for users to navigate. Additionally, existing systems do not effectively match schemes to users based on personal eligibility factors like income, education, or location.
To solve this, the paper proposes a Retrieval-Augmented Generation (RAG)-based system that allows users to ask questions via text or voice in multiple languages. The system retrieves relevant scheme information from a structured dataset and uses a language model to generate clear, context-aware answers. It also includes an Interactive Voice Response (IVR) system to support users with low literacy or limited digital access.
A key contribution is a large structured dataset of 3,400+ government schemes, created through web scraping and cleaning. The system also evaluates different document chunking strategies, combining field-aware, word-based, and token-based methods to improve retrieval accuracy.
For searching, the system uses a hybrid retrieval approach, combining semantic search (dense embeddings using models like MiniLM) with traditional keyword-based BM25 scoring. Retrieved results are then passed to a generative model (Gemini 2.5 Flash Lite) to produce final responses in text or speech.
The literature review shows that while RAG systems, hybrid retrieval methods, and multilingual NLP have advanced significantly, most existing work focuses either on retrieval performance or user interface—not both together. There is also limited research specifically targeting structured government scheme data.
Conclusion
A multilingual RAG-assistant is proposed in this paper to assist in gaining better access to information about various government welfare schemes. This system tries to resolve problems associated with fragmentated sources, linguistic diversity, and inability to provide personal assistance by combining features of both semantic search and response generation. A structured database comprising more than 3,400 government policies is created and put into use to support efficient retrieval. The experiment assesses the effect of varying chunking and retrieval approaches on this data, showing that field-sensitive data representation, together with hybrid retrieval, considerably enhances retrieval effectiveness. This suggests that tailoring retrieval methods for structured multifield databases is crucial.
Apart from optimization of retrieval processes, the system has been designed with multilingual search queries and IVR interface to increase accessibility for users residing in rural or less digitally accessible areas. With appropriate integration of both retrieval techniques and user-oriented designs, the solution becomes highly viable.
References
[1] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9459–9474, 2020.
[2] V. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” in Proc. EMNLP, pp. 6769–6781, 2020, doi: 10.18653/v1/2020.emnlp-main.550.
[3] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embed- dings using Siamese BERT-Networks,” in Proc. EMNLP-IJCNLP, pp. 3982–3992, 2019, doi: 10.18653/v1/D19-1410.
[4] S. Robertson and H. Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond,” Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009, doi: 10.1561/1500000019.
[5] O. Khattab and M. Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT,” in Proc. SIGIR,pp. 39–48, 2020, doi: 10.1145/3397271.3401075.
[6] G. Izacard and E. Grave, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering,” in Proc. EACL, pp. 874–880, 2021, doi: 10.18653/v1/2021.eacl-main.74.
[7] G. Izacard et al., “Atlas: Few-shot Learning with Retrieval Augmented Language Models,” Journal of Machine Learning Research, vol. 24, no. 251, pp. 1–43, 2023.
[8] Y. Mao et al., “Generation-Augmented Retrieval for Open-Domain Question Answering,” in Proc. ACL, pp. 4089–4100, 2021, doi: 10.18653/v1/2021.acl-long.316.
[9] J. Huang et al., “A Survey on Retrieval-Augmented Text Generation,”arXiv preprint arXiv:2202.01110, 2022.
[10] Y. Zhu et al., “Large Language Models for Information Retrieval: A Survey,” ACM Transactions on Information Systems, 2023, doi: 10.1145/3626774.
[11] Z. Zhu et al., “Retrieving and Reading: A Comprehensive Survey on Open-Domain Question Answering,” ACM Computing Surveys, vol. 54, no. 4, pp. 1–36, 2021, doi: 10.1145/3442697.
[12] G. Kazai, M. Lalmas, and T. Roelleke, “Focussed Structured Document Retrieval,” in Proc. SPIRE, pp. 241–252, 2002, doi: 10.1007/3 ?540 ? 45735 – 6_23.
[13] J. Reid et al., “Best Entry Points for Structured Document Retrieval,” Information Processing & Management, vol. 42, no. 2, pp. 493–511, 2006, doi: 10.1016/j.ipm.2004.07.007.
[14] J. Kim et al., “A Field Relevance Model for Structured Document Retrieval,” in Proc. European Conference on Information Retrieval (ECIR), pp. 233–244, 2012, doi: 10.1007/978?3?642?28997?2_21.
[15] L. Zhao and J. Callan, “A Generative Retrieval Model for Structured Documents,” in Proc. CIKM, pp. 1229–1238, 2008, doi: 10.1145/1458082.1458243.
[16] J. Clark et al., “TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454–470, 2020, doi: 10.1162/tacla00317.
[17] P. Lewis et al., “MLQA: Evaluating Cross-lingual Extractive Question Answering,” in Proc. ACL, pp. 7315–7330, 2020, doi: 10.18653/v1/2020.acl-main.653.
[18] M. Artetxe et al., “On the Cross-lingual Transferability of Mono-lingual Representations,” in Proc. ACL, pp. 4623–4637, 2020, doi: 10.18653/v1/2020.acl-main.421.
[19] T. Pires, E. Schlinger, and D. Garrette, “How Multilingual is Multilingual BERT?” in Proc. ACL Workshop, pp. 499–504, 2019, doi: 10.18653/v1/P19-1493.
[20] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” in Proc. ACL, pp. 8440–8451, 2020, doi: 10.18653/v1/2020.acl-main.747.
[21] L. Xue et al., “mT5: A Massively Multilingual Pre-trained Text- to-Text Transformer,” in Proc. NAACL, pp. 483–498, 2021, doi: 10.18653/v1/2021.naacl-main.41.
[22] R. Dabre et al., “IndicBART: A Pre-trained Model for Indic Natural Language Generation,” in Findings of ACL, pp. 351–362, 2022, doi: 10.18653/v1/2022.findings-acl.31.
[23] K. K. Nirala et al., “A Survey on Providing Customer and Public Administration Based Services Using AI Chatbot,” Multimedia Tools and Applications, vol. 81, pp. 1–32, 2022, doi: 10.1007/s11042-021-11458-4.